This post is not a tutorial but rather a collection of notes that I have taken while learning pandas. I will try to keep it updated as I learn more about pandas. I don’t want to write a tutorial because I want to keep it short and simple and mention only the features that I use most.
What is pandas?
pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. Wikipedia
Pandas has become the de facto standard for data analysis in Python. It is a powerful library that provides a high-performance, easy-to-use data structures and data analysis tools. It is built on top of NumPy and Matplotlib.
Installation
A good practice is to create a virtual environment for each project. This will allow you to install the required packages without affecting the global Python installation. You can use the following command to create a conda environment:
conda create -n pandas_intro python=3.10
conda activate pandas_intro
Pandas is available on PyPI and can be installed using pip:
pip install pandas
Personally, I prefer using pandas in Jupyter Lab. The Jupyter Lab is a web application that allows you to create and run Jupyter notebooks.
To install Jupyter Lab, you need to install the latest version of JupyterLab from the official repository.
pip install jupyterlab
Or using the conda package:
conda install -c conda-forge jupyterlab
To launch Jupyter Lab, you need to start it from the command line:
jupyter-lab
After starting Jupyter Lab, you can open a new notebook and start writing your code your web browser.
Importing pandas
A common practice is to import pandas as pd :
import pandas as pd
Pandas data frames
Pandas data frames are a powerful data structure that allow you to work with tabular data. They are similar to the tables in relational databases.
Here is an example of a data frame created from a list of columns:
df = pd.DataFrame([
['A', 'B', 'C'],
[1, 2, 3],
[4, 5, 6]
])
The data frame has three columns and three rows.
df.shape
Here is an example of a data frame created from a dictionary:
df = pd.DataFrame({
'A': [1, 4],
'B': [2, 5],
'C': [3, 6]
})
Pandas data frame can be created from an excel file:
df = pd.read_excel('data.xlsx')
The resulting data is a dictionary of data frames. The keys of the dictionary are the sheet names.
df['Sheet1']
for df_sheet in df:
print(df_sheet.shape)
Data frame indexing and slicing
Data frames can be indexed by row and column.
df.iloc[0] # row 0
df.iloc[0, 0] # row 0, column 0
df.loc[0] # row 0
df.loc[0, 'A'] # row 0, column A
I personally prefer to use the loc method because it is more intuitive.
df.loc[0, 'A'] # row 0, column A
df.loc[0, ['A', 'B']] # row 0, columns A and B
A data frame can be sliced by row and column:
df.iloc[0:2] # rows 0 and 1
df.iloc[0:2, 0:2] # rows 0 and 1, columns 0 and 1
df.loc[0:2] # rows 0 and 1
df.loc[0:2, 'A'] # rows 0 and 1, column A
df.loc[0:2, ['A', 'B']] # rows 0 and 1, columns A and B
Data frame basic operations
In this section, we will see how to perform basic operations on data frames.
Adding rows and columns
To add a new row at the end of the data frame, you can use the append method:
df = df.append({
'A': 7,
'B': 8,
'C': 9
}, ignore_index=True) # ignore_index=True to reset the index
To add a new column, you can use the assign method:
df = df.assign(D=[10, 11, 12])
To add a new row at a specific position, you can use the insert method:
df = df.insert(0, 'E', [13, 14, 15])
Deleting rows and columns
The drop method can be used to delete rows and columns:
df.drop(0) # delete row 0
df.drop(columns=['A'], axis=1) # delete column A
Sometimes you may want to delete rows and columns that contain missing values.
df.dropna(axis=0, how='any') # delete rows that contain missing values
df.dropna(axis=1, how='any') # delete columns that contain missing values
To drop a column than contains only missing values:
df.dropna(axis=1, how='all') # delete columns that do not contain missing values
To drop duplicate rows:
df.drop_duplicates(keep='first', inplace=True) # delete duplicate rows except the first one (inplace=True to modify the data frame)
Editing data
To edit a cell, you can use the at method:
df.at[0, 'A'] = 0 # replace data in row 0, column A with 0
Replacing data
To replace data in a data frame, you can use the replace method:
df.replace(to_replace=1, value=0) # replace 1 with 0
A simpler and more intuitive way to replace data is to use the map method:
df['A'].map({1: 0, 4: 7}) # replace 1 with 0 and 4 with 7
To replace data in a column, you can use the replace method:
df['A'] = df['A'].str.replace('A', 'B') # replace A with B
String operations
For string operations, you can use the str property methods:
df['A'] = df['A'].str.lower() # convert to lowercase
df['A'] = df['A'].str.upper() # convert to uppercase
df['A'] = df['A'].str.strip() # remove leading and trailing spaces
df['A'] = df['A'].str.split(' ') # split by space
df['A'] = df['A'].str.replace(' ', '_') # replace space with underscore
df['A'] = df['A'].str.capitalize() # capitalize first letter
df['A'] = df['A'].str.replace(r'(\d+)', r'(\1)', regex=True) # replace digits with parentheses using regex
Filtering rows and columns
A data frame can be filtered by a condition expressed as a boolean expression (a mask).
mask = df['A'] > 1
masked_df = df[mask]
Data frame sorting
df.sort_values(by='A') # sort by column A
df.sort_values(by=['A', 'B']) # sort by columns A and B
Data frame merging
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df1.merge(df2, on='A') # merge on column A
Data frame aggregation
The groupby method can be used to group data by a column. In this example, the data is grouped by the first column and the mean of the second column is calculated:
df.groupby('A').mean() # group by column A and compute mean
Data frame concatenation
To be able to concatenate data frames, the data frames must have the same columns.
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
pd.concat([df1, df2]) # concatenate df1 and df2
Data frame statistics
Statistics
The describe method can be used to calculate basic statistics:
df.describe()
Correlation
Correlation is a measure of the strength of the relationship between two variables. intuitively, it is a measure of how much one variable is related to the other.
In this example, correlation is the correlation between the first and second column:
df.corr()
Data frame plotting
Pandas uses matplotlib to plot data frames. Pandas data frames can be plotted using the plot method:
df.plot(kind='scatter', x='A', y='B')
Data frame conversion
A data frame can be converted to a multitude of formats. Personally I convert data frames to dictionaries and lists.
Converting to a dictionary
To convert a data frame to a dictionary, you can use the to_dict method:
df.to_dict()
Exporting to a json file
To export a data frame to a json file, you can use the to_json method:
with open('data.json', 'w') as f:
df.to_json(f)
If the data frame contains unicode characters, you can use the ensure_ascii parameter:
with open('data.json', 'w', encoding='utf-8') as f:
df.to_json(f, ensure_ascii=False)
Converting to a list of dictionaries
To convert a data frame to a list of dictionaries, you can use the to_dict method setting the orient parameter to records :
df.to_dict(orient='records')
Converting to a list of lists
To convert a data frame to a list of lists, you can use the values method:
df.values.tolist()
References
From the pandas documentation:
Articles:
Tutorials:
Leave a comment